Development of Compound Clustering Techniques Using Hybrid Soft-Computing Algorithms
نویسنده
چکیده
Databases of molecular structures available to the pharmaceutical industry comprise millions of molecules. With the advent of combinatorial chemistry, a vast number of compounds can be available either physically or virtually, which can make screening all of them infeasible in terms of time and cost. Therefore, only a subset of the entire database that encompasses the full range of structural types of the underlying dataset needs to be selected for screening to maximise the likelihood of finding as many biologically distinct active compounds as possible in a screening experiment. One of most used compound selection method is cluster-based compound selection, which involves subdividing a set of compounds into clusters and choosing one compound or a small number of compounds from each cluster. Selecting only representative compounds from each cluster is based on the assumption that structurally similar molecules have similar properties. A good clustering method groups similar compounds together, to ensure all activity classes are represented, whilst separating active and inactive compounds into different sets of clusters, to avoid an inactive compound being selected as a cluster representative. Hierarchical clustering methods such as Ward’s and Group Average are considered industry standard for compound selection purposes. Previously, there is limited work on the clustering and classification of biologically active compounds into their activity based classes using fuzzy and neural network. Furthermore, it has been found that many of the biologically active molecular structures exhibit more than one activity in which case they can be used as drugs for the treatment of more than one disease. However, previous clustering methods on chemical compounds are mostly limited to hard partitioning, which allows a compound to belong to only one cluster. In this work, neural, fuzzy and hybrid methods are utilized for the clustering of biologically active molecular structures into their corresponding activity classes. The methods have been evaluated for their performance on MDL’s MDDR, NCI’s AIDS and IDDB drug databases containing various biologically active classes of molecular structures. The neural network methods use a number of heuristics to find appropriate parametric values. Initially, the heuristics needs user intervention to select optimal values, which give poor results. To overcome this problem, fuzzy memberships have been employed to find optimal parameters. Since fuzzy clustering methods such as the fuzzy c-means and fuzzy G – K are computationally exhaustive in terms of time and memory requirements, a hierarchical approach have also been used in this work for their implementation. The hierarchical fuzzy clustering algorithm developed in this work assign the overlapping structures (structures having more than one activity) to more than one clusters if their fuzzy membership values are significantly high for those clusters. When compared with industry standard methods, the neural networks show very poor performance when 2-D bit-strings descriptors are used. However, their relative performance improves when used with topological indices as descriptors. The fuzzy and fuzzy neural methods show slightly better results than the industry standard methods. The hierarchical fuzzy clustering method developed here is far better than a similar implementation of the hard k-means method. When used for overlapping structures, its performance improves significantly. Although the neural network methods are not very effective in clustering biologically active structures, their performance is remarkable when used as classifiers. The feed forward and radial basis functions networks show higher learning capabilities than support vector machines and rough set classifier in the classification of datasets comprising more than two classes. However, their performance is slightly inferior to that of support vector machines for binary classification of chemical structures into drug and non drug compounds. IRPA 04-02-06-0093 EA001 Final Report 1 TABLE OF CONTENT Chapter 1 ...................................................................................................3 Introduction..............................................................................................................3 1.1. Background of the Problem .........................................................................5 1.2. Problem Statement........................................................................................7 1.3. Objectives of the Study .................................................................................8 1.4. Scope of the Study .........................................................................................9 1.5. Milestones of the Project ..............................................................................9 1.6. Research Frame Work ...............................................................................11 1.7. Research Contributions..............................................................................12 1.8. Report Organization ...................................................................................13 Chapter 2 .................................................................................................14 Chemoinformatics and Compound Clustering ...................................................14 2.1. Chemical Structures Representation ........................................................15 2.1.1. Fragmentation Codes...........................................................................16 2.1.2. Linear Notations...................................................................................17 2.1.3. Connection Tables................................................................................17 2.2. Molecular Descriptors ................................................................................18 2.2.1. 2-Dimensional Descriptors ............................................................18 2.2.2. 3-Dimensional Descriptors ............................................................21 2.3. Classical Clustering Methods.....................................................................23 2.3.1. Single Linkage Algorithm .............................................................26 2.3.2. Complete Linkage algorithm ........................................................26 2.3.3. Group Average Algorithm ............................................................26 2.3.4. Centroid Clustering Algorithm ....................................................26 2.3.5. Median Clustering Algorithm.......................................................27 2.3.6. Ward’s Clustering Algorithms .....................................................27 2.3.7. Single Pass Algorithm....................................................................28 2.3.8. Jarvis Patrick’s Algorithm............................................................29 2.3.9. K-means Algorithm .......................................................................29 2.4. Classification of Chemical Compounds ....................................................30 Chapter 3 .................................................................................................31 Neural Networks and Compound Clustering......................................................31 3.1. Unsupervised Neural Networks .................................................................33 3.1.1. Kohonen Neural Network ...................................................................33 3.1.2. Neural Gas Network ............................................................................34 3.1.3 Enhanced Neural Gas (ENG) Algorithm ............................................36 3.2. Supervised Neural Networks .....................................................................37 3.2.1. Multi Layer Perceptron(MLP) ...........................................................37 3.2.2. Radial Basis Function (RBF) Network ..............................................39 3.3. Other Machine Learning Methods............................................................39 3.3.1. Support Vector Machines....................................................................39 3.3.2. Rough Set Classifier.............................................................................41 3.4. Experiments and Results ............................................................................42 3.4.1. Experiment 1 ........................................................................................42 3.4.2. Experiment 2 ........................................................................................47 IRPA 04-02-06-0093 EA001 Final Report 2 3.4.3 Experiment 3 .........................................................................................51 3.5. Summary......................................................................................................53 Chapter 4 .................................................................................................55 Fuzzy Logic & Compound Clustering .................................................................55 4.1 Fuzzy Logic...................................................................................................55 4.2. Fuzzy Set Theory.........................................................................................56 4.3. Membership Function ................................................................................57 4.4. Fuzzy Clustering .........................................................................................60 4.4.1. Hard and Soft Clustering ....................................................................61 4.4.2 Fuzzy c-means .......................................................................................64 4.4.3 Fuzzy GustafsonKessel Algorithm ....................................................67 4.4.4 Modified GustafsonKessel Algorithm...............................................69 4.5 Experiments and Results .............................................................................72 4.5.1 Experiment 1 .........................................................................................72 4.5.2 Experiment 2 .........................................................................................76 4.6. Summary......................................................................................................80 Chapter 5 .................................................................................................82 Hybrid Techniques and Compound Clustering ..................................................82 5.1. Fuzzy Kohonen Self-Organizing Feature Map ........................................82 5.1.1. Fuzzy Kohonen Network (FKN) Algorithm ......................................84 5.2. A Hierarchical Fuzzy Algorithm ...............................................................87 5.3. Validity of Clustering for Fuzzy Hierarchical Algorithm.......................89 5.4. Genetic Algorithm and Clustering ............................................................91 5.5. Experiments and Results ............................................................................92 5.5.1. Experiment1 .........................................................................................92 5.5.2. Experiment 2 ........................................................................................95 5.5.3. Experiment 3 ......................................................................................103 5.5.4. Experiment 4 ......................................................................................106 5.6. Summary....................................................................................................110 Chapter 6 ...............................................................................................111 Discussions and Conclusions ...............................................................................111 6.1. Research Objectives and their Achievements ........................................111 6.2. Neural Networks and Chemical Structures............................................112 6.3. Fuzzy Clustering .......................................................................................113 6.4. Hybrid Clustering Methods .....................................................................113 6.5. Conclusion .................................................................................................114 References..............................................................................................115 IRPA 04-02-06-0093 EA001 Final Report
منابع مشابه
Utilization of Soft Computing for Evaluating the Performance of Stone Sawing Machines, Iranian Quarries
The escalating construction industry has led to a drastic increase in the dimension stone demand in the construction, mining and industry sectors. Assessment and investigation of mining projects and stone processing plants such as sawing machines is necessary to manage and respond to the sawing performance; hence, the soft computing techniques were considered as a challenging task due to stocha...
متن کاملApplication of non-linear regression and soft computing techniques for modeling process of pollutant adsorption from industrial wastewaters
The process of pollutant adsorption from industrial wastewaters is a multivariate problem. This process is affected by many factors including the contact time (T), pH, adsorbent weight (m), and solution concentration (ppm). The main target of this work is to model and evaluate the process of pollutant adsorption from industrial wastewaters using the non-linear multivariate regression and intell...
متن کاملGenerating Optimal Timetabling for Lecturers using Hybrid Fuzzy and Clustering Algorithms
UCTTP is a NP-hard problem, which must be performed for each semester frequently. The major technique in the presented approach would be analyzing data to resolve uncertainties of lecturers’ preferences and constraints within a department in order to obtain a ranking for each lecturer based on their requirements within a department where it is attempted to increase their satisfaction and develo...
متن کاملAcademic performance evaluation using soft computing techniques
This article presents a study of academic performance evaluation using soft computing techniques inspired by the successful application of K-means, fuzzy C-means (FCM), subtractive clustering (SC), hybrid subtractive clustering-fuzzy C-means (SC-FCM) and hybrid subtractive clustering-adaptive neuro fuzzy inference system (SC-ANFIS) methods for solving academic performance evaluation problems. M...
متن کاملApplication of Soft Computing Methods for the Estimation of Roadheader Performance from Schmidt Hammer Rebound Values
Estimation of roadheader performance is one of the main topics in determining the economics of underground excavation projects. The poor performance estimation of roadheader scan leads to costly contractual claims. In this paper, the application of soft computing methods for data analysis called adaptive neuro-fuzzy inference system- subtractive clustering method (ANFIS-SCM) and artificial neu...
متن کاملA New Fuzzy Clustering Based Method to Increase the Accuracy of Software Development Effort Estimation
Project planning plays a significant role in software projects so that imprecise estimations often lead to the project faults or dramatic outcomes for the project team. In recent years, various methods have been proposed to estimate the software development effort accurately. Among all proposed methods the non algorithmic methods by using soft computing techniques have presented considerable re...
متن کامل